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Abstract. Beginning with a discussion of R. A. Fisher's early writ- 
ten remarks that relate to dimension reduction, this article revisits 
principal components as a reductive method in regression, develops 
several model-based extensions and ends with descriptions of general 
approaches to model-based and model-free dimension reduction in re- 
gression. It is argued that the role for principal components and related 
methodology may be broader than previously seen and that the com- 
mon practice of conditioning on observed values of the predictors may 
unnecessarily limit the choice of regression methodology. 

Key words and phrases: Central subspace, Grassmann manifolds, in- 
verse regression, minimum average variance estimation, principal com- 
ponents, principal fitted components, sliced inverse regression, sufficient 
dimension reduction. 
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1. INTRODUCTION 

R. A. Fisher is responsible for the context and 
mathematical foundations of a substantial portion of 
contemporary theoretical and applied statistics. One 
purpose of this article is to consider insights into the 
long-standing and currently prominent problem of 
"dimension reduction" that may be available from 
his writings. Two papers are discussed in this re- 
gard. Fisher's pathbreaking 1922 paper on the the- 
oretical foundations of statistics (Section 1.1), and 
a later applications paper on the yield of wheat at 



R. Dennis Cook is Professor, School of Statistics, 
University of Minnesota, 224 Church Street S. E., 
Minneapolis, Minnesota 554-55, USA e-mail: 
dennis @stat. umn. edu. 

^This article is the written version of the 2005 Fisher lec- 
ture delivered at the Joint Statistical Meetings in Minneapo- 
lis, Minnesota. 

^Discussed in 10.1214/088342307000000041, 
10.1214/088342307000000050 and 
10.1214/088342307000000069; rejoinder at 
10.1214/088342307000000078. 



This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2007, Vol. 22, No. 1, 1-26. This 
reprint differs from the original in pagination and 
typographic detail. 



Rothamsted (Section 1.2). The discussion of these 
articles makes liberal use of quoted material, in an 
effort to preserve historical flavor and reflect Fisher's 
style. 

Principal component analysis is one of the oldest 
and best known methods for reducing dimension- 
ality in multivariate problems. Another purpose of 
this article is to provide an exposition on princi- 
pal components as a reductive method in regres- 
sion, with emphasis on connections to known re- 
ductive methods and on the development of a new 
method — principal fitted components (PFC) — that 
may outperform principal components. The discus- 
sion will be related to and guided by Fisher's writ- 
ings as much as the nature of the case permits. In 
particular, the philosophical spirit of the methods 
discussed in this article derives largely from Fisher's 
notion of sufficiency. 

We briefly review principal components in Sec- 
tion 2, and starting in Section 3 we focus on princi- 
pal components in regression. Principal fitted com- 
ponents are introduced in Section 4. In Sections 5-7 
we expand the themes of Sections 3 and 4, grad- 
ually increasing the scope of dimension reduction 
methodology. A general model-based paradigm for 
dimension reduction in regression is described in 
Section 8.1. In keeping with the Fisherian theme of 
this article, the development is model-based, but in 
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Section 8.2 we describe its relation to recent ideas for 
model-free reductions. The Appendix contains jus- 
tification for propositions and other results. Empha- 
sis is placed on ideas and methodological directions, 
rather than on the presentation of fully developed 
methods. 

Reduction by principal components has been pro- 
posed as adjunct methodology for linear regression. 
It does not arise as a particular consequence of the 
model itself but is used to mitigate the variance in- 
flation that often accompanies collinearities among 
the predictors. Indeed, while collinearity is the main 
and often the only motivation for use of principal 
components in regression, it will play no role in the 
evolution of the methods in this article. It is ar- 
gued in Sections 5.1, 5.2 and 8 that the utility of 
principal component reduction is broader than pre- 
viously seen and need not be tied to the presence of 
collinearity. This conclusion is a consequence of pos- 
tulating inverse regression models that lead to prin- 
cipal components and principal fitted components 
as maximum likelihood estimators of reductive sub- 
spaces. 

Ordinary least squares (OLS) is widely recognized 
reasonable first method of regression when the 
response and predictors follow a nonsingular multi- 
variate normal distribution. Nevertheless, examples 
are given in Sections 5 and 6.4 to demonstrate that 
in this context reduction by principal components 
and principal fitted components may dominate OLS 
without invoking collinearity. Sliced inverse regres- 
sion (SIR; Li, 1991) is a relatively recent reductive 
method for regression that has received notable at- 
tention in the literature. New drawbacks of SIR will 
emerge from its relation to principal fitted compo- 
nents described in Sections 6.3 and 7.5. It is demon- 
strated by example that the method of principal fit- 
ted components can dominate SIR as well as OLS in 
the multivariate normal setting. The notions of prin- 
cipal components and principal fitted components 
are presented here as reductive frames not neces- 
sarily tied to normality. Their construction in the 
context of exponential families is sketched in Sec- 
tion 3.3 and elsewhere. 

Conditioning on the observed values of the pre- 
dictors is a well-established practice in regression, 
even when the response and the predictors have a 
joint distribution. However, as a consequence of the 
exposition on principal components and related re- 
ductive methods, we argue in Section 8 that this 



practice may unnecessarily restrict our choice of re- 
gression methodology. It may be advantageous in 
some regressions to make explicit use of the vari- 
ability in the predictors through their multivariate 
inverse regression on the response. 

1.1 Fisher, 1922 

Much of contemporary statistical thought began 
with Fisher's 1922 article "On the mathematical 
foundations of theoretical statistics," which set forth 
a new conceptual framework for statistics, includ- 
ing definitions of Consistency, Efficiency, Likelihood, 
Specification and Sufficiency. Fisher also introduced 
the now familiar terms "statistic," "maximum like- 
lihood estimate" and "parameter." The origins of 
this remarkable work, which in many ways is re- 
sponsible for the ambient texture of statistics to- 
day, were traced by Stigler (1973, 2005), who em- 
phasized that Fisher was the first to use the word 
"parameter" in its modern context, 57 times in his 
1922 article (Stigler, 1976). More than any other 
single notion, this word reflects the starting point — 
parametric families — for Fisher's constructions. Many 
of his ideas are now common knowledge, to the point 
that he is no longer credited when they are first in- 
troduced in some statistics texts. This paper, per- 
haps more than any other of Fisher's, should be re- 
quired reading in every statistics curriculum. 

Fisher provided a focal point for statistical meth- 
ods at the outset of his 1922 article: 

. . . the objective of statistical methods is 
the reduction of data. A quantity of data. . . 
is to be replaced by relatively few quanti- 
ties which shall adequately represent. . . the 
relevant information contained in the orig- 
inal data. 

Since the number of independent facts sup- 
plied in the data is usually far greater than 
the number of facts sought, much of the 
information supplied by an actual sample 
is irrelevant. It is the object of the sta- 
tistical process employed in the reduction 
of data to exclude this irrelevant informa- 
tion, and to isolate the whole of the rele- 
vant information contained in the data. 

In these statements Fisher signified the goal of sta- 
tistical methods as a type of dimension reduction, 
the reduced data containing the relevant and only 
the relevant information. The fundamental idea that 
statistical methods deal with the reduction of data 
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did not originate with Fisher, and was probably 
widely understood well before his birth in 1890 (see, 
e.g., Edgeworth, 1884). However, Fisher's approach 
is unique because it changed the course of statistical 
history. 

Fisher identified three distinct problems in his re- 
ductive process: 1. "Problems of Specification," se- 
lecting a parametric family; 2. "Problems of Estima- 
tion," selecting statistics for parameter estimation; 
and 3. "Problems of Distribution," deriving the sam- 
pling distribution of the selected statistics or func- 
tions thereof. Problems of specification and estima- 
tion are both reductive in nature. Surely, selecting 
a finitely and parsimoniously parameterized family 
from the infinity of possible choices is a crucial first 
reductive step that can overshadow any subsequent 
reduction of the data for the purpose of estimation. 

Regarding estimation. Fisher said that once the 
model is specified 

. . . the statistic chosen should summarize 
the whole of the relevant information sup- 
plied by the sample. 

Any operational version of this idea must include 
some way of parsing data into the relevant and ir- 
relevant. For Fisher, the reductive process started 
with a parametric family, targeted information on 
its parameters 6 and was guided by sufficiency: If D 
represents the data, then a statistic t{D) is sufficient 
if 

(1) D\{e,t)r^D\t 

so that t contains all of the relevant information 
about 9. Subsequent commentaries on sufficiency by 
Fisher and others address existence, minimal suffi- 
ciency, specializations and variations, relation to the 
method of maximum likelihood, and the factoriza- 
tion theorem. 

Fisher was quite specific on how to approach the 
second reductive step, estimation, but was less so 
regarding the overarching first reductive step, spec- 
ification. In stating that problems of specification 
". . . are entirely a matter for the practical statisti- 
cian," Fisher positioned them as a nexus between 
applied and theoretical statistics. A model is re- 
quired before proceeding to problems of estimation, 
and model specification falls under the purview of 
the practical statistician. He also offered the follow- 
ing helpful but nonprescriptive advice: 



... we may know by experience what forms 
are likely to be suitable, and the adequacy 
of our choices may be tested a posteriori. 
We must confine ourselves to those forms 
which we know how to handle. ..." 

In these statements Fisher acknowledged a role for 
statisticians as members of scientific teams, antic- 
ipated the development of diagnostic methods for 
model criticism and recognized a place for off-the- 
shelf models. Interest in diagnostic methods for re- 
gression was particularly high from a period starting 
near the time of Fisher's death in 1962 and ending 
in the late 1980s (Anscombe, 1961; Anscombe and 
Tukey, 1963; Box, 1980; Cook and Weisberg, 1982; 
Cook, 1986). Fisher also linked model complexity 
with the amount of data, and evidently did not re- 
quire that models be "true," 

More or less elaborate forms will be suit- 
able according to the volume of the data. 
Evidently these are considerations the na- 
ture of which may change greatly during 
the course of a single generation. 

Fisher's first reductive step is the most challenging 
and elusive. Indeed, Fisher's views launched a de- 
bate over modeling that continues today. Writing 
on Fisher's discovery of sufficiency, Stigler's (1973) 
introduction began with the sentence: 

Because Fisher's concept of sufficiency de- 
pends so strongly on the assumed form 
of the population distribution, its impor- 
tance to applied statistics has been ques- 
tioned in recent years. 

Box's (1979) memorable statement that "All mod- 
els are wrong, but some are useful" has been taken 
to imply, perhaps from a position of devil's advo- 
cate, that D is the only sufficient statistic. At least 
two Fisher lectures, Lehmann in 1988 (Lehmann, 
1990) and Cox in 1989 (Cox, 1990), were on the is- 
sue of model specification. And then there are the 
modeling cultures of Breiman (2001) and McCul- 
lagh's (2002) rather esoteric answer to the question 
"What is a statistical model?" Fourteen years after 
his 1922 article, Fisher suggested that it might be 
possible to develop inductive arguments for model 
specification, a promise that has yet to be realized: 

Clearly, there can be no operation prop- 
erly termed 'estimation' until the param- 
eter to be estimated has been well de- 
fined, and this requires that the mathe- 
matical form of the distribution shall be 
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given. Nevertheless, we need not close our 
eyes to the possibility that an even wider 
type of inductive argument may some day 
be developed, which shall discuss methods 
of assigning from the data the functional 
form of the population (Fisher, 1936). 

1.2 Fisher, 1924 

In Fisher's classic 1922 article we see him think- 
ing primarily as a theoretical statistician, while in 
his 1924 paper "III. The influence of rainfall on the 
yield of wheat at Rothamsted" we see him as an ap- 
plied statistician. Having recently developed a new 
framework for theoretical statistics, he might have 
quickly integrated those ideas into his applied work, 
but that does not appear to be the case. He did rely 
on his 1922 article to justify a claim that a skewness 
statistic is the "most efficient statistic" for a test of 
normality (Fisher, 1924, page 103), but otherwise I 
found no clear formal links between the two works. 

The opening issue that Fisher addressed in 1924 
involves sparsity of data in regression, possibly "n < 
p." According to Fisher, large sample regression meth- 
ods are appropriate when the sample size n is much 
larger than the number of predictors p, preferably 
n > 1000 but at least in the hundreds. However, 
the number p of meteorological variables that might 
plausibly affect yield could easily exceed the length 
n of the longest run of available crop records. Fisher 
then faced a dimension reduction problem rather 
like those encountered in the analysis of contempo- 
rary genomics data. He did not provide a general 
solution, but did give a clear opinion on what not 
to do. It was common practice at the time to pre- 
process the potential meteorological predictors by 
plotting them individually against yield to select the 
predictors for the subsequent regression. Fisher was 
critical of this practice, concluding that 

The meteorological variables to be employed 
must be chosen without reference to the 
actual crop record. 

He also reiterated a theme of his 1922 paper, "Rela- 
tionships of a complicated character should be sought 
only when long series of crop data are available." 

The bulk of Fisher's article is devoted to the devel- 
opment of models for the regression of yield on rain- 
fall, models that were painstakingly tailored to the 
substantive characteristics of the problem, as one 
would expect in careful application. Fisher's study 



seems true to his general point that model specifi- 
cation is "entirely a matter for the practical statis- 
tician," and the corollary that it depends strongly 
on context. Regarding dimension reduction, there is 
at least one general theme in Fisher's analysis: An 
"n < p" regression might be usefully transformed 
into an "n > p*" regression, but the methodology 
for doing so should not depend on the response. 

Sufficiency aside (and that is a lot to set aside), 
I found little transportable methodology in Fisher's 
writing that might be applied to contemporary di- 
mension reduction problems in regression. Never- 
theless, dimension reduction was an issue during 
Fisher's era and before. In the next section we turn 
to a widely recognized dimension reduction method 
in statistics — principal components — whose begin- 
nings occurred well before Fisher's time. 

2. INTRODUCTION TO PRINCIPAL 
COMPONENTS 

It is well established that principal components 
are a useful and important foundation for reduc- 
ing the dimension of a multivariate random sam- 
ple, represented here by the vectors Xi, X2, . . . , X„ 
in MP. Letting Ai > • • • > Ap and 7^^, . . . ,7p denote 
the eigenvalues and corresponding vectors of XI = 
Var(X), the population principal components are 
defined as the linearly transformed variables 
{7^ X, . . . ,7jX}. We call the 7j's the principal com- 
ponent (PC) directions. The sample principal com- 
ponents are {7-f X, . . . , 7JX}, where 7^^, . . . , 7^ are 
the eigenvectors (sample PC directions) correspond- 
ing to eigenvalues Ai > • • • > Ap of the usual sample 
covariance matrix S. 

The history of principal components goes back at 
least to Adcock (1878) who wished to 

Find the most probable position of the 
straight line determined by the measured 
coordinates, each measure being equally 
good or of equal weight, (xi, yi), (a;2, 2/2), 
. . . , {xn,yn) of n points 

Adcock identified his solution as the "principal axis," 
or first sample PC direction. Subsequent milestones 
were given in articles by Pearson (1901), Spearman 
(1904) and Hotelhng (1933), but I found no direct 
reference to principal components in Fisher's writ- 
ings. It has been discovered over time that the lead- 
ing principal components, say {7^ X, . . . , 7^X} , k <C 
p, have a number of properties that may be help- 
ful, depending on application-specific requirements. 
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Reviews of principal components were given by Se- 
ber (1984), Christensen (2001) and Jolliffe (2002). 
Gould (1981, Chapter 6) provided an interesting and 
illuminating historical account on the use of princi- 
pal components in the social sciences. New prop- 
erties and methods seem to be communicated of- 
ten in contemporary statistical literature (see, e.g., 
Jong and Kotz, 1999; Tipping and Bishop, 1999; 
Maronna, 2005). 

Principal components have also been studied in 
the context of regression, where X is now the vec- 
tor of predictors that we would like to reduce prior 
to performing a regression with response Y. There 
are several reasons why such reduction may be use- 
ful in practice, including the possibilities of miti- 
gating the effects of collinearity, facilitating model 
specification by allowing visualization of the regres- 
sion in low dimensions (Cook, 1998) and providing 
a relatively small set of predictors on which to base 
prediction or interpretation. Collinearity in partic- 
ular has been the main motivation for using princi- 
pal components as a reductive method in regression. 
One persistent idea is that perhaps we can use the 
leading principal components in place of X with lit- 
tle loss of information: The first few principal com- 
ponents should contain essentially the same infor- 
mation about Y as the original predictors, which is 
in the spirit of Fisher's idea of sufficiency. Kendall 
(1957, page 75), Hocking (1976) and others sug- 
gested using principal components in this way. Scott 
(1992) suggested that the predictors might be "com- 
pressed" by using their principal components prior 
to additive nonparametric modeling. Such proce- 
dures have been questioned because principal com- 
ponents are computed from the marginal distribu- 
tion of X and consequently the leading components 
may have little necessary relation with the response. 
It does seem optimistic to think that the marginal of 
X would necessarily be structured so that the lead- 
ing principal components contain the essential infor- 
mation about the response. Nevertheless, in support 
of the leading principal components, Mosteller and 
Tukey (1977, page 397) turned this cause for concern 
into a desirable goal by posing the question: 

. . . how can we find linear combinations 
of the [predictors] that will be likely, or 
unlikely, to pick up regression from some 
as yet unspecified y? 

It might seem unusual that Mosteller and Tukey 
would begin by asking how to perform linear re- 
duction without reference to the response, but their 



approach is in agreement with Fisher's (1924) point 
that predictors ". . . be chosen without reference to 
the actual crop record." Mosteller and Tukey an- 
swered their question with the philosophical point 
that: 

A malicious person who knew our x's and 
our plan for them could always invent a 
y to make our choices look horrible. But 
we don't believe nature works that way — 
more nearly that nature is, as Einstein put 
it (in German) , "tricky, but not downright 
mean." 

On the other hand. Cox (1968, page 272) wrote in 
reference to reducing X by using the leading princi- 
pal components: 

A difficulty seems to be that there is no 
logical reason why the dependent variable 
should not be closely tied to the least im- 
portant principal component. 

A similar sentiment was expressed by Hotelling (1957), 
Hawkins and Fatti (1984) and others. Evidently, many 
authors did not trust nature in the same way as 
Mosteller and Tukey. Some gave examples of re- 
gressions where Cox's prediction apparently holds, 
with a few trailing principal components contribut- 
ing important information about the response (Jol- 
hffe, 1982; Hadi and Ling, 1998). Are there limits 
to the maliciousness of nature? If nature can pro- 
duce regressions where there is useful information 
about the response in the last principal component 
7p X, can it also produce settings where the regres- 
sion information is concentrated in a few of the orig- 
inal predictors, but is spread evenly across many of 
the principal components, making analysis on the 
principal component scale harder than analysis in 
the original scale? This possibility is relevant in re- 
cent proposals to ignore the hierarchical structure of 
the principal components and use instead a general 
subset M of them when developing linear regres- 
sion models (Hwang and Nettleton, 2003). In this 
regard, Jolliffe (2002, page 177) commented that 
"... the choice of M for PC regression remains an 
open question." 

On balance, the role for principal components in 
regression seems less clear-cut than their role in re- 
ducing X marginally. The advantages that princi- 
pal components enjoy in the multivariate setting, 
where the marginal distribution of X is of primary 
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interest, may be of limited relevance in the regres- 
sion context, where the conditional distribution of 
y|X is of interest. It seems particularly appropri- 
ate to question the usefulness of principal compo- 
nents when X is fixed and controlled by the ex- 
perimenter. In some experimental designs I] is pro- 
portional to the identity, with the consequence that 
there is clearly no useful relation between its eigen- 
structure and the regression. And yet interest in us- 
ing principal components for dimension reduction 
in regression has persisted, notably in the analy- 
sis of microarrays where principal components have 
been called "eigengenes" (Alter, Brown and Bot- 
stein, 2000). Chiaromonte and Martinelli (2002) and 
L. Li and H. Li (2004) used principal components for 
preprocessing microarray data, allowing subsequent 
application of other dimension reduction methodol- 
ogy that requires n> p. Acknowledging that prepro- 
cessing by principal components might be a viable 
alternative, Bura and Pfeiffer (2003) used marginal 
f-tests from the regression of the response on each 
predictor for prior selection of genes, a method to 
which Fisher would likely have objected. 

3. PRINCIPAL COMPONENTS IN 
REGRESSION 

In the rest of this article we consider only regres- 
sions where X is random, so Y and X have a joint 
distribution, since a different structure seems nec- 
essary when X is fixed and subject to experimental 
control. In this context we may pursue dimension re- 
duction through the conditional distribution of y |X 
(forward regression) the conditional distribution of 
X|y (inverse regression), or the joint distribution 
of (y, X). All three settings have been used in ar- 
guments for the relevance of principal components 
and related methodology. Forward regression with 
fixed predictors is perhaps the most common, par- 
ticularly in early articles. Helland (1992) and Hel- 
land and Alm0y (1994) assumed that (y, X) follows 
a multivariate normal in their development of "rel- 
evant components." Oman (1991) used an inverse 
model for X|y in a heuristic argument that the co- 
efficient vector a in a linear regression model for 
y|X should fall in or close to the space spanned by 
the first few principal components. Similarly, in their 
development of multiple-shrinkage principal compo- 
nent regression, George and Oman (1996) used a 
model for the joint distribution of (y, X) to reach 
the same conclusion. 



In this article, we concentrate on X|y, although 
the goal is still to reduce the dimension of X with lit- 
tle or no loss of information on y|X. Model-based 
forward regression analyses traditionally condition 
on the observed values of the predictors, a charac- 
teristically Fisherian operation, even if X is random 
(see Savage, 1976, page 468, and Aldrich, 2005, for 
further discussion of this point). Nevertheless, the 
conditional distribution of X|y may provide a bet- 
ter handle on reductive information since it can be 
linked usefully to y|X (Proposition 1), and I found 
nothing in Fisher's writings that would compel con- 
sideration of only forward regressions. To facilitate 
the exposition, let Xy denote a random variable dis- 
tributed as X.\(Y = y). Subspaces are indicated as 
where the argument is a matrix whose columns 
span the subspace, and U_ILV|W means that the 
random vectors U and V are conditionally indepen- 
dent given any value for the random vector W. The 
notation R"^'' stands for the space of real matrices 
of dimension a x b, and means the space of real 
vectors of length a. 

3.1 A First Regression Model Implicating 
Principal Components 

Consider the following multivariate model for the 
inverse regression of X on y: 

(2) X,y = /X + TiJy + as. 

Here fxeW,Te WP""^, d<p, T^T = I^, > and d 
is assumed to be known. To emphasize the condi- 
tional nature of the model, y is used to index obser- 
vations in place of the more traditional "i" notation. 
The coordinate vector Uy G is an unknown func- 
tion of y that is assumed to have a positive definite 
sample covariance matrix and is centered to have 
mean 0, J2y^y = Oj but is otherwise unconstrained. 
The centering of Vy in the sample is for later conve- 
nience and is not essential. The error vector e G is 
assumed to be independent of Y, and to be normally 
distributed with mean and identity covariance ma- 
trix. This model specifies that, after translation by 
the intercept fi, the conditional means fall in the 
d-dimensional subspace 5r spanned by the columns 
of r. The vector Vy contains the coordinates of the 
translated conditional mean E(Xj^) — fi relative to 
the basis T. The mean function is quite flexible, even 
when d is small, say at most 3 or 4. Aside from the 
subscript y, nothing on the right-hand side of this 
model is observable. 
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While the mean function for model (2) is permis- 
sive, the variance function is restrictive, requiring 
essentially that the predictors be in the same scale 
and that, conditional on Y, they be independent 
with the same variance. Nevertheless, this model 
provides a useful starting point for our considera- 
tion of principal components, and may be appropri- 
ate for some applications dealing with measurement 
error, image recognition, microarray data and cali- 
bration (Oman, 1991). And it may serve as a useful 
first model when the data are not plentiful, following 
the spirit of Fisher's recommendations. 

Model (2) is similar in form to the functional model 
for multivariate problems studied by Anderson (1984), 
but there are important conceptual differences. Model 
(2) is a formal statement about the regression of X 
on Y, with the Vy^s being fixed because we con- 
dition on Y, in the same way forward regression 
models are conditioned on X. Functional models for 
multivariate data postulate latent fixed effects with- 
out conditioning. Model (2) also resembles the tra- 
ditional model for factor analysis, but again there 
are important differences. There is no response in a 
factor analysis model and no conditioning. Instead 
V, the vector of common factors, is assumed to be 
jointly distributed with the error e, with conditions 
imposed on the joint distribution to ensure identi- 
fiability. Additionally, rotation to obtain a "mean- 
ingful" estimate of T is often of interest in factor 
analysis, while in this article interest in T does not 
extend beyond Sr- Model (2) will lead in directions 
that are not available when considering functional 
or factor-analytic models for multivariate data. 

The following proposition connects the inverse re- 
gression model (2) with the forward regression of Y 
on X. 

Proposition 1. Under the inverse model (2), 
the distribution o/y|X is the same as the distribu- 
tion o/y|r"^X for all values ofJi-. 

According to this proposition, X can be replaced 
by the sufficient reduction F^^X without loss of in- 
formation on the regression of y on X, and without 
specifying the marginal distribution of Y or the con- 
ditional distribution of y|X. Here "sufficient reduc- 
tion" is used in the same spirit as Fisher's "sufficient 
statistic." One difference is that sufficient statistics 
are observable, while sufficient reductions may con- 
tain unknown parameters, T in this instance, and 
thus need to be estimated. A reasonable estimate of 
r is needed for the sufficient reduction F^^X to be 
useful in practice. 



3.2 Estimation via the Method of Maximum 
Lil<elihood 

Even with the previously imposed constraint that 
r^^r = Jrf, F and Vy are not simultaneously estimable 
under model (2) since, for any orthogonal matrix 
O G M'^^'^, we can always rewrite Tuy = (FO)(0^i>'y), 
leading to a different factorization. However, the 
reductive subspace Sr = span(F) is estimable, and 
that is the focus of our inquiry. The matrix F is of 
interest only by virtue of the subspace generated by 
its columns, the condition F^F = Id being imposed 
for convenience. This implies that two reductions 
F^X and AF^X that are connected by a full-rank 
linear transformation A are regarded as equivalent 
for present purposes. To emphasize this distinction 
we will refer to the parameter space for St as a 
Grassmann manifold Qpxd of dimension d in M^. 

Consider a function G{A) that is defined on the 
set oi p X d matrices with d <p and A-'^ A = I^, and 
has the property that G(AO) = G{A) for any or- 
thogonal matrix O G M'^^'^. The function G(A) de- 
pends only on the span of its argument: If span(A) = 
span(B), then G(A) = G(B). The set of d-dimensional 
subspaces of is called a Grassmann manifold, a 
single point in a Grassmann manifold being a sub- 
space. A Grassmann manifold is the natural param- 
eter space for the F parameterization in this article. 
While no technical use will be made of Grassmann 
manifolds, the terminology is proper in this context 
and it may serve as a reminder that only the sub- 
space Sr is of interest. For background on Grass- 
mann manifolds, see Edelman, Arias and Smith (1998) 
and Chikuse (2003). 

Estimation of a sufficient reduction might be based 
on the method of moments, Bayesian considerations 
or a concern for robustness, but staying with Fisher 
we will use the method of maximum likelihood. As- 
suming that the data consist of a random sample of 
size n from (Y, X) , the maximum likelihood estima- 
tor of Sr can be constructed by holding F and 
fixed at possible values G and and then maxi- 
mizing the log likelihood over fi and Vy. This yields 
/i = X and, in the absence of replicate y's, a sepa- 
rate value for each of the n d-dimensional vectors Uy 
as a function of (G,s^): i'y{G,s'^) = G'^{'Xy — X). 
Substituting back, the partially maximized log like- 
lihood Mpc is then, apart from constants, 

Mpc(G,s2) 

= -{np/2)log{s^) 
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- (1/2.2) ^ _ X _ p^(x^ _ x)||2 
y 

= -(np/2)log(s2) - (n/2s2)trace(EQG), 

where = GG-^ is the projection onto Sc and 
Qg = Ip — Pg- Although in this and later likelihood 
functions we use G as an argument, the function it- 
self depends only on Sc and thus maximization is 
over the Grassmann manifold Qpxd- Recalling that 
{Xj} and {7j} are the eigenvalues and eigenvectors 
of S, it follows that the maximum likelihood esti- 
mator Sy of St is 

5r = span{7i,...,7rf}, 

and that the estimator of 0"^ is a"^ = J2^=d+i '^j/P- 
A sufficient reduction is thus estimated by the first d 
sample principal components, and for this reason we 
will refer to model (2) as a PC regression model. If 
there is replication in the observed y's, as may hap- 
pen if Y is supported on a finite number of points, 
then the principal components are to be computed 
from the usual between-class covariance matrix. 

In the absence of replication in the y's, the ob- 
served responses play no role other than acting col- 
lectively as a conditioning argument. The same re- 
duction would be obtained with a continuous multi- 
variate response. In fact, it is not necessary for the 
response to be observed and thus model (2) is one 
possible route to satisfying Hosteller and Tukey's 
desire to reduce X for an ". . . as yet unspecified 
y," and is in accord with Fisher's requirement that 
". . . the variables to be employed must be chosen 
without reference to the actual crop record." How- 
ever, model (2) does not hold for all potential re- 
sponses, but requires that the predictors be condi- 
tionally independent with common variance. 

T|}e fact that all full-rank linear transformations 
Ar X of the first d sample principal components 

r X are equivalent reductions in the context of 
model (2) may cast doubt on the usefulness of post 
hoc reification of principal components. The safest 
interpretations are perhaps the ones that suggest 
regression mechanisms that can be verified indepen- 
dently of the principal components themselves. 

As mentioned previously, the PC model (2) may 
be appropriate for some applications, but there is 
also value in the paradigm it suggests for extension 
to other settings. In the next section we sketch at 
the conceptual level how the ideas behind (2) can be 
applied to exponential families, returning to normal 
models in Section 4. 



3.3 Extensions to Exponential Families 

As in the PC model, we assume that the predic- 
tors are independent given Y, but instead of requir- 
ing Xy to be normally distributed we assume that 
the jth conditional predictor Xyj is distributed ac- 
cording to a one-parameter exponential family with 
density or mass function of the form 

(3) fj{x\r]yj,Y = y) = aj{r]yj)bj{x) e^p{xr]yj). 

We also assume that the natural parameter ijyj fol- 
lows the model for the conditional mean E(Xy) in 
the normal case, 

Vyj = l^j +ljl^y, i = l,---,P, 

where fij is the jth coordinate of and "yj is the jth 

row of r. Under this setup T-^X is again a sufficient 
reduction: 

Proposition 2. Under the inverse exponential 
model (3), the distribution of y|X is the same as 
the distribution o/y |r'^X for all values of^. 

Given y, fi and T the likelihood for Vy is con- 
structed from the products of (3) for j = I,..., p. 
This corresponds to fitting a generalized linear model 
with offsets /x^, "predictors" 7^- and regression co- 
efficient Vy. The remaining parameters /x and Sr G 
Gpxd can then be estimated by combining these in- 
termediate partially maximized likelihoods over y. 
The essential point here is that generalized PC mod- 
els can be constructed from the normal PC model 
(2) in the way that generalized linear models are 
constructed from linear models. Marx and Smith 
(1990) developed a method of principal component 
estimation for generalized linear models. Their ap- 
proach, while recognizing issues that come with gen- 
eralized linear models, seems quite different from the 
approach suggested here. 

For example, if the coordinates of X^ are inde- 
pendent Bernoulli random variables, then we can 
express the model in terms of a multivariate logit de- 
fined coordinate-wise as multlogitj^ = fx + Ti>y, where 
the right-hand side is the same as the mean function 
for the PC model (2). Since the predictors are con- 
ditionally independent, we can write 

Pr(X = x|y = y) = n Pj{yrqj{y)'-^\ 

where x = (xj), pj{y) = Pr(Aj = Xj\y), qj{y) = 1 - 
Pj{y) and ^og{pj/qj) = fij+jji^y. The log likelihood 
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is therefore 

y lj=i J 

which is to be maximized over ^ G M^, Vy^ and 
G Gpxd- By analogy with the normal case, this 
leads to principal components for binary variables, 
but they are not computed from the eigenvectors 
of a covariance matrix and they seem distinct from 
the recent proposal by de Leeuw (2006) for marginal 
reduction of X. 

4. PRINCIPAL FITTED COMPONENTS 

In the PC model (2) no direct use is made of the 
response, which plays the role of an implicit condi- 
tioning argument. In principle, once the response is 
known, we should be able to tailor our reduction to 
that response. One way to adapt the reduction for a 
specific response is by modeling Vy. This can be fa- 
cilitated by graphical analyses when the response is 
bivariate or univariate. Recalling that Xj denotes 
the j'th predictor in X, we can gain information 
from the data on the mean function E(Xj|y) by in- 
vestigating the p two- or three-dimensional inverse 
response plots of Xj versus y, j = 1, . . . as de- 
scribed by Cook and Weisberg (1994, Chapter 8) 
and Cook (1998, Chapter 10). While such a graphi- 
cal investigation might not be practical if p is large, 
it might be doable when p is in the tens, and cer- 
tainly if p is less than, say, 25. 

Assume then that Uy = pfy, where (3 G M^^*", d < 
r, has rank d and fy G M'' is a known vector- valued 
function of the response with J2y — 0- For exam- 
ple, if it is decided that each inverse mean function 
E(Xj|y = y) can be modeled adequately by a cubic 
polynomial in y, then fy equals {y,y'^,y^)'^ minus its 
sample average. When Y is univariate and graphi- 
cal guidance is not available, iy could be constructed 
by first partitioning the range of Y into h = r + 1 
"slices" or bins H/^, and then setting the kth coor- 
dinate fyk of fy to 

(4) fyk = J{y ^Hk) - rik/n, k=l,...,r, 

where J is the indicator function and is the num- 
ber of observations falling in Hk. The jth coordinate 
I'yj of Vy is then modeled as a constant in each slice 

r r 

^yj = E f^jkfyk = E l^3k{J{y G Hk) - Uk/n), 

k=l k=l 



where Pjk is the jfcth element of /3. Each coordinate 
of the vector Uy = (3iy is now a step function that is 
constant within slices. Many other possibilities for 
basis functions are available in the literature. For 
example, we might adapt a classical Fourier series 
form (see, e.g., Eubank, 1988, page 82) and set fy = 
Sy-S, where 

gy = (cos(27ry),sin(27ry),..., 

cos(27rA;y), sin(27r/cy))"^ 

with r = 2k. We will use the slice basis function (4) 
later in this article because it leads to a connec- 
tion with sliced inverse regression (SIR, Li, 1991). 
No claim is made that this is a generally reason- 
able nonparametric choice for iy. However, the slice 
basis function can be used to allow for replication 
in model (2), with the slices corresponding to the 
unique values of y. The parameterization Vy = (Bfy 
can be used with exponential families as described 
in Section 3.3 and the PC model (2). 

Substituting Vy = (Bfy into model (2) we obtain 
the new model 

(5) Xj, = + Tf3iy + ae, 

for which F^^X is still a sufficient reduction. Let X 
denote the n x p matrix with rows (X.y — X)^, let 
F denote the n x r matrix with rows fj, and let 

X = PpX denote the n x p matrix of centered fit- 
ted values from the multivariate linear regression of 
X on fy, including an intercept. Holding F and a 
fixed at G and s, and maximizing the likelihood over 
fi and p, we obtain /i = X, ^ = G^X^F(F^F)-^ 
and the partially maximized log likelihood (see Ap- 
pendix A. 3 for details) 

Mppc(G,s2) 

(6) =(-np/2)log(s2) 

- (n/2s^){ trace [f;] - trace [Pc^fit]}, 

where Xlgt = X-^X/n is the sample covariance ma- 
trix of the fitted values. The likelihood again de- 
pends only on Sg. The rank of Sgt is at most r and 
typically rank(Sfit) = r. In any event, we assume 
that rank(Sfit) > d. The likelihood is then maxi- 
mized by setting Sr equal to the span of eigenvectors 
(j>i, . . . , (pij^ corresponding to the largest d eigenvalues 
Xf\i = l,...,d, of Efit. The presence of replication 
in the y's does not affect this estimator. 
We call 0i"X, . . . , 0p X principal fitted components 

(PFC), and call the associated eigenvectors cj^i, . . . ,cj)^ 
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PFC directions. The corresponding estimate of scale 
is = (ELi - Ef=i Af A sufficient reduc- 
tion under model (5) is then estimated by the first 
d PFC's. We call model (5) a PFC model to distin- 
guish it from the PC model (2). 

5. COMPARING PC'S AND PFC'S 

In the PC model (2) there are {n — l)d i/-parameters 
to be estimated, while in the PFC model (5) the 
corresponding number is just dr, which does not in- 
crease with n. Consequently, assuming that (5) is 
reasonable, we might expect it to yield more accu- 
rate estimates than (2). 

5.1 Simulating Sy 

A small simulation using multivariate normal (Y, X) 
was conducted to obtain first insights into the op- 
erating characteristics of principal components and 
principal fitted components. Here and in all other 
simulations we restrict F £ M^* (d = 1) because this 
allows straightforward comparisons with forward OLS 
The two component methods are applicable when 
d> 1, but then the OLS fit of y on X must neces- 
sarily miss d—\ directions. First, Y was generated 
as a normal random variable with mean and vari- 
ance CTy, and then Xj^ was generated according to 
the inverse model 

(7) X,y = ry + (j£, 

with r = (1, 0, . . . , 0)"^, p = 10 and 0- > 0. The for- 
ward regression y|X follows a textbook normal lin- 
ear regression model, 

(8) y = ao + Q^x + cry|xe, 

where x denotes an observed value of X, (Ty|x is 
constant, e is a standard normal random variable 
and, as indicated by Proposition 1 and its extension 
to PFC models, span(Q;) = span(r). Thus, E(y|X) 
depends on X via F-^X, which is equal to Xi for 
this simulation. Let a denote the OLS estimator of 
ex. We consider three ways of estimating 5r, each 
based on a correct model: OLS, using span(S); PC, 
using span(7]^); and PFC, using span(0^) from the 
fit of model (5) with iy = y — y. 

Each data set was summarized conveniently by 
computing the angle between T and each of 7i (PC), 
c()i (PFC) and a (OLS). Other summary measures 
were tested but they left the same qualitative im- 



pressions as the angle. The mean squared error (9) 
used in Figure 1(d) and discussed in Section 5.2 is 
one instance of an alternative summary measure. 
Three aspects of the simulation model were varied, 
the sample size n, the conditional error standard de- 
viation a and the marginal standard deviation of Y, 
ay. It might be expected that PFC will do better 
than PC in this simulation. While both methods are 
based on a correct model, the PFC model is more 
parsimonious. On the other hand, it could be more 
difficult to anticipate a relation between these in- 
verse methods and OLS, which is based on a correct 
forward model. 

Shown in Figure 1(a) are average angles taken over 
500 replications versus n, with a = ay = 1. On the 
average, the OLS vector was observed to be a bit 
closer to Sr than the PC vector ■Ji, except for small 
samples, when -ji was the better of the two. More 
importantly, the PFC vector 0^ was observed to be 
more accurate than the other two estimators at all 
sample sizes. The difference between the PFC and 
OLS estimators can be increased by varying ay, as 
is apparent from Figure 1(b). 

Figure 1(b) shows the results of a second series of 
simulations at various values of ay with n = 40 and 
(7 = 1. The method of principal fitted components 
is seen to be superior for small values of ay, while 
it is essentially equivalent to principal components 
for large values. Perhaps surprisingly, the OLS esti- 
mator is clearly the worst method over most of the 
range of ay. Figure 1(c) shows average angles as a 
varies with n = 40 and ay = 1. Again, PFC is seen to 
be the best method, with the relative performance 
of PC and PFC depending on the value of a. The 
accuracy of the PC or PFC estimates of St should 
improve as n increases, or as ay increases or as a 
decreases. The results in Figure 1 agree with this 
expectation. 

To explain the relative behavior of the OLS es- 
timates in the simulation results for model (7), let 
i2 = cjf/((jf + 0-2). Then it can be shown that a = 
RT, that ■y/n(S — a) is asymptotically normal with 
mean and covariance matrix 

Var(a) = RQr + R{1 - R)Pr, 

and that cx'^[Yar{a)]^^a. = R/{1 — R). As ay oo, 
i? — > 1 but Var(S) > RQr - Thus, Var(S) is bounded 
below, and this explains the behavior of the OLS es- 
timates in Figure 1(b). On the other hand, as a ^ 
oo, a —>-0 and Var(S) 0, but a-'"[Var(S)]~-'^a 
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(c) Angle vs. a (d) \/MSE vs, ay 

Fig. 1. Simulation results for model (7). (a)-(c) display average simulation angles between the estimated and the true 
direction versus (a) sample size with ay = a = 1; (b) ay with n — 40, a — 1; and (c) a with n = 40, ay = 1- (d) is the square 
root of the standardized MSE of prediction versus ay with n = 40 and a = 1. 



as well. Consequently, q — > faster than the "stan- 
dard deviation" of its estimates and the performance 
of a must deteriorate. These results can be extended 
straightforwardly to model (2) with d=l, but then 
the behavior of the OLS estimator will depend on 
Var(i>'y) and Cov{uy,Y). 

Letting denote the A;th element of the T in 
model (7), the marginal correlation between the jih 
and kth predictors, j ^ k, is 

^k^jCTy 

If Tj and are both nonzero, then \pjk\ — > 1 as 
(Ty ^ oo with (7^ fixed. However, F = (1, 0, . . . , 0)^ 



in the version of model (7) used to produce Fig- 
ure 1(b), with the consequence that the predictors 
are both marginally and conditionally independent. 
Thus, the bottoming out of the OLS estimator in 
Figure 1(b) for large values of cry and in Figure 1(c) 
for small values of a has nothing to do with collinear- 
ity. To gain further insights into the impact of collinear- 
ity in this situation, we repeated the simulation lead- 
ing to Figure 1(b) with F = (1, . . . , l)^/^To. Now, 
Pjk ^ 1 as uy oo for all pairs of predictors. How- 
ever, within the simulation error, these new results 
were identical to those of Figure 1 (b) . This suggests 
that the value of principal component estimators 
does not rest solely with the presence of collinearity. 
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5.2 Prediction 

While this article is focused on estimation of re- 
ductive subspaces, this first simulation study pro- 
vides a convenient place to touch base with predic- 
tive considerations. Since the forward regression fol- 
lows a normal linear model, we characterize predic- 
tive performance by using the scaled mean squared 
error 

(9) MSE = E(yj-a-&f^X/)V<Ty|x> 

where (Y/-,Xj) represents a future observation on 
(y, X), and the expectation is taken over (Yy,Xj) 
and the data. The forward model error variance <7y|x 
was used for scaling because the numerator of (9) is 
in the units of Y^, and the MSE will be constant for 
OLS as we vary ay. f can be 7i (PC), 0i (PFC) 
or fi (OLS). The intercept a and slope b were then 

computed from the OLS fit of Yi on F Xj (6 = 1 for 
fi). The MSE, which is bounded below by 1, was es- 
timated by first calculating the expectation explic- 
itly over the future observations and then using 500 
simulated data sets to estimate the remaining expec- 
tation over the data. Figure 1(d) shows the resulting 
MSE as a function of ay and the three estimators. 
The PFC estimator performed the best, except for 
small values of ay, where the PC estimator did bet- 
ter. The OLS estimator was dominated by the other 
two. 

6. INVERSE MODELS WITH STRUCTURED 
ERRORS 

Let To G RP^(P~^) denote a completion of T; that 
is, (ro,r) e is an orthogonal matrix. The PC 
model (2) and the PFC model (5) have the prop- 
erty that Y is independent of Tq X both marginally, 
yjir^X, and conditionally, y _IL Tj^XlT^X. Con- 
sequently, (y, r^X) _IL Tq X. This enables us to iden- 
tify Tq X unambiguously as irrelevant information 
to be excluded (Fisher, 1922). The PC and PFC 
models also have quite restrictive conditions on 
Var(Xj^) = a'^Ip. In this section we extend the vari- 
ance function of these models while preserving the 
form of the relevant and irrelevant information. 

6.1 An Extended PC Model 

Consider the PC model with heterogeneous errors, 

(10) Xy = tM + Tuy + Tonoeo + Tfte, 



where fi, T, Fq and fy are as defined previously. 
The error vectors £o £ MP~'^ and £ G M!^ are inde- 
pendent and normally distributed, each with mean 
and identity covariance matrix. The full-rank ma- 
trices n G M^^'' and flo G R(p-d)xip-d) serve in part 
to convert the normal errors to appropriate scales 
and, without loss of generality, are assumed to be 
symmetric. Since the two error components must be 
independent, the variance structure in this model 
is still restrictive, but it is considerably more per- 
missive than the variance structure in the first PC 
model (2) and the components of X can now be in 
different scales. 

Linearly transforming both sides of model (10), we 
have F^Xj^ = F^/x + Vy + fts and V'^^y = F^/x + 
rio^o- From these representations we see that 
(F,Fo)"^Xy contains two independent components. 
The active y-dependent component F"^Xy consists 
of the coordinates of the projection of X onto Sy- 
It has an arbitrary mean and constant but a general 
covariance matrix. The other projective component 
lives in the orthogonal complement span(Fo) = 5p 
and has constant mean and variance. Most impor- 
tantly, this extended model preserves the indepen- 
dence property that {Y, F"^X) _IL F^X, so F^^X is a 
sufficient reduction: 

Proposition 3. Under the extended PC model 

(10) , the distribution o/y|X is the same as the dis- 
tribution o/y|F^X for all values of^- 

Turning to estimation, we assume that the y's are 
distinct. Replication for model (10) can be addressed 
with a version of model (13), which is discussed in 
Section 6.2, by using the slice basis function for fy 
to indicate identical y's. Maximizing over fi, Uy and 
fio, the partially maximized log likelihood Lpc is a 
function of possible values G and M for F and fi^, 
apart from constants: 

Lpc(G,M) 

= -(n/2)log(|G^SGo|) - (n/2) log(|M|), 

where Go is a completion of G. fi^ is not estimable, 
but because the likelihood factors, it is maximized 
over Go for any fixed M by any full-rank linear 
transformation of the last p — d sample PC direc- 
tions. Consequently, we might be tempted to take 

(11) 5r = span-^(7rf+i,...,7p). 

Thus a sufficient reduction is again estimated by the 
first d sample principal components computed from 
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S. Nevertheless, additional conditions are necessary 
for principal components to work well under this 
model. 

Some of the differences between the PC model (2) 
and the extended version (10) can be seen in the 
marginal covariances of the predictors. Under model 
(2) 

S = a^ror^ + T{a^Id + Var(i.y )}r^. 

Here the smallest eigenvalue of 5] is equal to o"^ 
with multiplicity p — d and corresponding eigenvec- 
tors Tq. The largest d eigenvalues, which are all 
larger than cr^, are the same as the eigenvalues of 
(T^/(i + Var(i/y). The corresponding PC directions 
are Tyj, j = 1, . . . ,d, where {vj} are the eigenvec- 
tors of a'^Id + Var(i>'y). In this case Sy and are 
distinguished by the magnitudes of the correspond- 
ing eigenvalues of SI, providing the likelihood infor- 
mation to identify Sy- 
Under model (10) 

I] = TonlTl + V{n^ + Var(i/y )}r^ 

= roVoDoV^r;^^ + rvDv'^r^, 

where VoDoVq" and VDV-^ are the spectral decom- 
positions of a'^d -I- Var(i/y). The PC direc- 
tions under model (10) can be written unordered as 
FoVq and TV with eigenvalues given by the corre- 
sponding elements of the diagonal matrices Dq and 
D. The estimate of 5r given in (11) will be consis- 
tent if the largest eigenvalue in Dq is smaller than 
the smallest eigenvalue in D. In other words, the 
likelihood-based estimator (11) should be reasonable 
if the signal as represented by D is larger than the 
noise as represented by Dq. 
To help fix ideas, consider the extension of model (7) 

(12) ^y = Ty + aoro£o + fTTe, 

where (1^, X) is normally distributed, and the suffi- 
cient reduction still has dimension 1. The forward 
regression y|X follows a normal linear regression 
model where the mean function depends on X only 
via r^X, and 

5] = alv^vl + (4 + c72)rr^. 

If (Jq < CTy + CT^, then the first population princi- 
pal component yields a sufficient reduction F-^X. If 
(Tq > cTy -|- 0"^ , then a sufficient reduction is given by 
the last population principal component, illustrating 
the possibility suggested by Cox (1968). However, if 
ctq = (Ty -|- 0"^, then principal components will fail 
since all of the eigenvalues of S are equal. 



In short, principal components may yield reason- 
able reductions if the signal dominates the noise in 
the extended PC model (10). 

6.2 An Extended PFC Model 

In this section we consider a PFC model with het- 
erogeneous errors by modeling the i/-parameters in 
(10), 

(13) Xj, = /X + F/3fy + Fofloeo + rfie, 

where all terms are as defined previously, except we 
no longer require that d < r. The maximum like- 
lihood estimators under (13) are not affected by 
the presence of replication in the y's. Extensions of 
the exponential family model (3) to permit depen- 
dence among the predictors may depend on the par- 
ticular family involved. For instance, the quadratic 
exponential model described by Zhao and Prentice 
(1990) might be useful when the predictors are cor- 
related binary variables. 

Under model (13), F'^X is again a sufficient re- 
duction, so we are still interested in estimating the 
reductive subspace Sr- Maximizing the log likeli- 
hood over fi and (3, we find the partially maximized 
log likelihood, apart from unimportant constants, 

-(n/2)log|Mo| - (n/2)trace[M(7^G^X^XGo/n] 

-(n/2)log|M|, 
-(n/2)trace[M~^G'^{X^X-X'^PFX}G/?i], 

where G, M and Mq represent possible values for 
F, ri^ and CIq. After maximizing over M and Mq 
(see, e.g., Muirhead, 1982, page 84), the maximum 
likelihood estimate of Sr is found by maximizing 

Lppc(G) = (-n/2)log|G;^SGo| 

(14) 

-(n/2)log|G^SresG|, 

where Sres = S — X)fit is the sample covariance ma- 
trix of the residuals from the fit of Xj, on fy. Like 
previous likelihoods, (14) depends only on Sg- But 
in contrast to the previous likelihoods, here there 
does not seem to be a recognizable estimate for 5r, 
and thus Sr must be obtained by maximizing (14) 
numerically over the Grassmann manifold Gpxd- 

The following proposition gives population ver- 
sions of matrices in the partially maximized like- 
lihood function (14). It summarizes some of the dis- 
cussion in Section 6.1, and provides results that will 
be useful for studying (14). 
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Proposition 4. Assume the extended PFC mo- 
del (13) with uncorrelated hut not necessarily normal 
errors, 'Var{{£Q ,e^)'^) = Ip. Then 

S ^ 5] = rofioro + T{ft^ + /3Var(fy)/3^}r^, 
Sfit ^ Sfit = Tp\av{fY)pV, 

^res ^res = ^O^Q^Q + rf^^T"^- 

To understand the behavior of the function LpFc(G) 

(14) in a bit more detail, write it in the form, with 
H = PfXG/V^, 

-2Lppc(G)/n 

= log IG^SGol + log |G^SG - GVPfXG/uI 

= log{|G^SGo||G^SG||/„-H(G^SG)-^H^|} 

= log{|G^SGo||G^SG||/„-PF^\G^'F|}. 
The first product in the log is such that 

(15) |G^SGo||G^SG| > 

and it achieves its lower bound when the columns of 
G are any d sample PC directions and the columns 
of Go consist of the remaining directions. Conse- 
quently, the function - log{|Gj^i:Go| IG^SG]} has 
at least (2) local maxima of equal height. It is then 
up to the last term — log|/n — -Pf^g^^I to re- 
shape -Lppc into a possibly multimodal surface over 
Qpxd with a single global maximum. If a G can 
be found so that span(XG) C span(F), then |/„ — 
-Pf-Pxg-PfI = and the log likelihood is infinite at 
its maximum. If the signal is weak, or in the extreme 
F'^X ^ 0, then |4 - PfPxg-Pf| ~ 1 for ah G, this 
term will contribute little to the log likelihood, and 
we will be left with a surface having perhaps many 
local maxima of similar heights. 

Because the likelihood surface may be multimodal, 
use of standard gradient optimization methods may 
be problematic in some regressions. Consideration 
of a candidate set of solutions or starting values 
might mitigate the problems resulting from a multi- 
modal surface and can further illuminate the role of 
principal components. It also provides a relatively 
straightforward way to explore this methodology if 
computer code for Grassmann optimization is not 
conveniently available. 

Candidate sets can be constructed using the fol- 
lowing rationale. Recall from the discussion in Sec- 
tion 6.1 that the unordered PC directions are of the 
form (roVo,rV), and they comprise one possible 
population candidate set. This structure means that 



there is a subset of d PC directions 7(i), • • • ,7((i), 
such that 

St = span(r V) = span{7(i) ,...,7(^)1. 

Consequently, provided that the eigenvalues corre- 
sponding to rV are distinct from those correspond- 
ing to FoVo, we can construct an estimator of the 
sufficient reduction V-^F^X by evaluating LpFc(G) 
at all subsets of d sample PC directions 7^]^) , . . . , 7(^-) , 
and then choosing the subset with the highest likeli- 
hood. We call this the PFCpc method, because the 
PFC likelihood is being evaluated at the PC direc- 
tions. This method might be awkward to implement 
if the number of combinations to be evaluated is 
large. As an alternative, we could choose PC direc- 
tions sequentially: 

1. Find 7(]^) = argmaxLppc(h), where the maximum 
is taken over the p x 1 vector h in the PC candi- 
date set A = {7j,i = 1, . . . 

2. Find 7^2) = argmaxLppc(7(i)5 h), where the max- 
imum is now taken over the p x 1 vector h in the 
reduced candidate set A — {7(1)}. 

3. Continue until reaching the final maximization 

7(d) = arg max Lppc (7(1) 7(^.1) , h) , 

where the maximum is now taken over h in ^ — 
{7(1)1 • • • i7(d-i)}- 

This approach is similar in spirit to other method- 
ology for selecting principal components (Jolliffe, 
2002), but here we base the selection on a likelihood. 

Using Proposition 4 and following the same ratio- 
nale we can construct two additional candidate sets. 
One consists of the PFC directions, the eigenvectors 
of Sfit) and the other contains the eigenvectors of 
Srcsi which are called the residual component {RC) 
directions. The estimator constructed by evaluating 
the PFC likelihood (14) at all subsets of the candi- 
date set consisting of the PC, PFC and RC direc- 
tions will be denoted as PFCaii. 

Before turning again to illustrative simulations, 
we connect principal fitted components with sliced 
inverse regression. 

6.3 PFC and SIR 

To relate SIR and PFC, fj, must be constructed us- 
ing the slice basis function (4). Let = J2ieHk '^i/'^^k 
and X = X]r=i ^j/'T-- Then it can be shown that Sgt 
is the sample covariance matrix of the slice means 
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h 

Sfit = E (^fc/^) - X) (Xfc - x)^ 

k=l 

^1/2- -1/2 

— 1/2— 1/2 

where Ssir = S Sgt^] is the usual SIR kernel 
matrix in the standardized scale of Z = S~^/^(X — 

^_l/2 

E (X) ) , with sample version Z = 5] (X — X) . Sub- 
stituting this form into the partially maximized log 
likelihood (14), we have 

-2Lppc(G)/n = log|G^SGo| 

+ log|G^s'/'{/p-Ssir}s'/'G|. 

Suppose now that we redefine (Gq, G) to be an or- 
thogonal matrix in the S inner product, (Gq, G)"^ x 
S(Go, G) = Ip. Then G^SGq = Ip-d and, letting 

— 1/2 

G* = 1] G, (14) can be written as a function of 
G* with G*^G* = Id, 

Lppc(G*) = -(n/2)log |G*'^{/p - £sir}G*|. 

This objective function results in the standard SIR 
estimates since it is maximized by setting the columns 
of G* to be the matrix G* whose columns are the 
first d eigenvectors of Xlsir- This solution is then 
back-transformed to the original X scale so that 

^ 1/2 

5r = 5] span(G*). From this we see that, start- 
ing from the normal likelihood for (13), the SIR so- 
lution is found by using a sample-based inner prod- 
uct. Relative to the likelihood, this can have a costly 
effect of neglecting the information in S when esti- 
mating the sufficient reduction. 

While SIR does not require normality or a partic- 
ular structure for Var(Xy), it is known that its oper- 
ating characteristics can vary widely depending on 
these features. Bura and Cook (2001) argued that 
generally SIR performs the best under normality 
and that its performance can degrade when Var(Xy) 
is not constant. Cook and Ni (2005) showed that 
SIR can be very inefficient when Var(X.y) is not con- 
stant and they provided a new model-free method 
called inverse regression estimation (IRE) that can 
dominate SIR in applications. From these and other 
articles, we would expect SIR to be at its best un- 
der the models considered here, which all have both 
normality and constant Var(Xy). 

Like SIR, the fundamental population character- 
istics of the PC and PFC methods considered here 



do not hinge on normality. From (14) and Proposi- 
tion 4, the normalized partially maximized log like- 
lihood LpFc(G)/n converges in probability to 

Lppc(G) = -(l/2)log|G;[SGo| 

-(l/2)log|G^I],esG|. 

We then have: 

Proposition 5. Assume the conditions of Propo- 
sition 4 ■ Then T = arg maxc LppQ (G) . 

This proposition says that the likelihood objec- 
tive function arising from the extended PFC model 
(13) produces Fisher consistent estimates when the 
errors are uncorrelated, but not necessarily normal, 
suggesting that normality per se is not crucial for 
the type of analysis suggested in this article. 

6.4 Illustration via Simulation 

A simulation study was conducted to illustrate 
some of the results to this point. We generated Y 
as a A^(0, ay) random variable, and then generated 
Xy according to model (12) with p = 10, sample 
size n = 250 and F = (1,0,..., 0)"^. As in previous 
simulations, dim(5r) = 1 to allow straightforward 
comparisons with OLS. In reference to model (13), 
the data were generated with CIq = a^Ip-i and Q, = 
a. Four estimators were applied to each data set: 
OLS, SIR with eight slices and PFCpc based on the 
likelihood for the extended PFC model (13) using 
the slicing construction (4) with eight slices for fy. 
To expand the comparisons, we also included a re- 
cent semiparametric estimator RMAVE (Xia et al., 
2002), which is based on local linear smoothing of 
the forward regression with adaptive weights and, 
like SIR, is expected to do well in the context of this 
simulation. The results were summarized by com- 
puting the angle between each of the four estimates 
and Sr- 

The angles plotted in Figure 2 are averages taken 
over 500 replications. Figure 2(a) is a plot of the av- 
erage angle versus the error standard deviation a for 
the signal, with ay = o"o = 1 . Clearly, the likelihood- 
based estimator PFCpc dominates SIR, OLS and 
RMAVE, except when a is close to 1, so the three 
variances in the simulation are close. The RMAVE 
estimator is indistinguishable from OLS in all of the 
simulations of Figure 2. In Figure 2(b) ay was varied 
while holding cj = ctq = 1 ■ Again we see that PFCpc 
dominates over most of the plot. 
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The error standard deviation do for the inactive 
predictors was varied for the construction of Fig- 
ure 2(c). Here the results are of a fundamentally 
different character. PFCpc performed the best for 
do < 1, and the three methods are roughly equiva- 
lent for the larger values of ctq. But in the middle 
region, PFCpc performed the worst. This poor per- 
formance arises because the estimate was computed 
using the PC candidate set, and in the simulations 
when (To = \/2 the population eigenvalues of Xl = 2Ip 
are equal. When 5] is spherical, the principal com- 
ponents are arbitrary and cannot be expected to 
convey any useful information. These observations 
allow us to guess about the qualitative behavior of 
the four estimators in Figure 2(a) when a < 1 and in 
Figure 2(b) when ay <1. For instance, as ay goes to 
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in Figure 2(b), all methods should exhibit deteri- 
orating performance, but PFCpc should again per- 
form the worst since the eigenvalues of I] converge to 
1. These guesses are confirmed by simulations (not 
shown) . 

Figure 2(d) was constructed as Figure 2(c), except 
the estimator PFCpc was replaced by the estimator 
PFCaii based on the candidate set of PC, PFC and 
RC directions. The PFCaii estimate of Sy always 
performed as well as or better than SIR, OLS and 
RMAVE, except in a neighborhood around ctq = \/2. 
The vector that maximized the likelihood always 
came from the PC candidate set for ao < 0.75, from 
the PFC candidate set for 1 <aQ < 1.25, and from 
the residual candidate set for ao >2. At ctq = 1.5 
this vector came about equally from the residual and 
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(b) Angle vs 17 v , with aa = t =1 
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(d) Angle vs uo. (t = ay =1 



Fig. 2. Simulation results based on model (12). In (a)-(c) the likelihood-based estimator was computed from the PC candidate 
set. In (d) the estimator was computed from the full candidate set containing the PC, PFC and RC directions. RMV is short 
for RMAVE. 
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PFC candidate sets. These results may provide some 
intuition into operational characteristics of the like- 
lihood as reflected through the three matrices given 
in Proposition 4. For this simulation example those 
matrices are 

Sfit = rr ay, 

Srcs = roFo (To + rr'^fj^. 

When (To is small, the first sample PC direction ev- 
idently provides the best estimate of Sr- When ao 
is large, we can gain information on Sy from the 
smallest PC or the largest PFC. Evidently, the er- 
ror variation that comes with ctq causes the smallest 
PC to be less reliable than the largest PFC. When 
ctq = (T^ +(7y, the PC directions provide no informa- 
tion on Sr, but the PFC and residual directions can 
both provide information. 

To gain insights about the potential advantages 
of pursuing the maximum likelihood estimator, we 
used a gradient algorithm for Grassmann manifolds 
(Edelman, Arias and Smith, 1998) to find a local 
maximum of the likelihood, starting with the best 
direction from the full candidate set. The local like- 
lihood solution resulted in improvements all along 
the "PFCaii" curve, with the greatest improvement 
in a neighborhood around (Tq = 1.5. For instance, at 
(To = 1.5, the average angle for the local likelihood 
solution was about 11 degrees, which is quite close 
to the average angles for SIR and OLS, while the 
average angle shown in Figure 2(d) for the full can- 
didate set is about 15 degrees. On balance, the local 
likelihood gave solutions that were never worse and 
were sometimes much better than those of the three 
competing methods. 

SIR and RMAVE are model-free methods of di- 
mension reduction for regression. Normally some loss 
would be expected relative to likelihood-based meth- 
ods when the model holds. But the magnitude of the 
loss in this simulation, particularly in Figures 2(a) 
and 2(b), is surprising. The behavior of SIR is likely 
explained in part by the discussion in Section 6.3. 
Some authors (see, e.g., L. Li and H. Li, 2004; and 
Chiaromonte and Martinelli, 2002) have used princi- 
pal components to reduce the dimension of the pre- 
dictor vector prior to a second round of reduction 
using SIR. The results of this simulation raise ques- 
tions regarding such methodology generally. If we 
are in a situation like Figure 2(a) or 2(b) where prin- 
cipal components do well, the motivation for switch- 
ing to SIR seems unclear. On the other hand, if we 



are in a situation like Figure 2(c) with ao ~ \/2, 
then there seems little justification for using princi- 
pal components in the first place. 

6.5 Model Selection 

The dimension d of the reductive subspace was 
assumed known in the discussion of the extended 
PFC model (13), but inference on d, which is in 
effect a model selection parameter, may be required 
in practice. Likelihood methods are a natural first 
choice, including penalized log likelihood methods 
like AIC and BIC. In this section we briefly consider 
one possibility for inference on d, and include two 
simple illustrative examples. 

If we set d = p, then we can take T = Ip and the 
extended PFC model (13) reduces to the standard 
multivariate normal linear model, Xy = fj, + (3{y + 
fte, which we call the "full model." All extended 
PFC models with d < p are properly nested within 
the full model and may be tested against it by us- 
ing a likelihood ratio. Let A^ denote —2 times the 
log likelihood ratio for comparing an extended PFC 
model to the full model. The dimension of the Grass- 
mann manifold Gpxd is d{p — d), which is the num- 
ber of real parameters needed to determine Sr and 
to simultaneously determine 5p. From this it can 
be verified that, under the PFC model of the null 
hypothesis, A^ is asymptotically chi-squared with 
r{p — d) degrees of freedom. 

We use the Horse Mussel data (Cook and Weis- 
berg, 1994) for the first example. The response is 
the logarithm of muscle mass, and the p = 4 pre- 
dictors are the logarithms of shell height, length, 
mass and width. Scatterplots of the predictor ver- 
sus the response indicate that = y — y is a rea- 
sonable choice. The extended PFC model (13) with 
d = 1 was fitted by using Grassmann optimization 
with the starting value chosen as the best direction 
from the full candidate set. This gave Ai = 3.3 with 
three degrees of freedom, indicating that these data 
provide little information to distinguish between the 
full model and the PFC model with d= 1. The pair- 
wise sample correlations between the estimated suf- 
ficient reduction, the first PC and the first PFC were 
all essentially 1, so the first PC produces an equiva- 
lent solution for these data. However, PC regression 
by itself does not allow one to infer strong indepen- 
dence, (y, r-^X) ILFq X. Using the cubic option for 
fy produced the same conclusions. 

Fearn's (1983; see also Cook, 1998, page 175) cal- 
ibration data are the basis for the second example. 
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The response is the protein content of a sample of 
ground wheat, and the predictors are — log(reflectano 
of NIR radiation at p = 6 wavelengths. The predic- 
tors are highly correlated in these data, with pair- 
wise sample correlations ranging between 0.92 and 
0.9991. As in the previous example, fy = y — y seemed 
to be a reasonable choice. Fitting the extended PCF 
model with this fy gave Ai =29.1 with five degrees 
of freedom and A2 = 2.6 with four degrees of free- 
dom. Consequently, we infer that the sufficient re- 
duction is composed of two linear combinations of 
the predictors, which can be viewed in a three-dimen- 
sional plot with the response. In contrast to the pre- 
vious example, here there does not seem to be an 
easily described relationship between the estimated 
sufficient reduction and the principal components. 
All of the principal components are related to the 
sufficient reduction in varying degrees, the strongest 
relationships involving the second, third and sixth 
components. 

Several data sets from the literature were studied 
similarly. The conclusion that d <p, and sometimes 
substantially so, was the rule rather than the ex- 
ception. In the next section we consider more gen- 
eral versions of the PC and PFC models by allowing 
for unstructured errors. This will provide a closer 
connection with some standard methodology, and 
may give intuition about the common practice of 
standardizing the data prior to computing princi- 
pal components. We concluded from this that the 
kinds of models proposed here will likely have wide 
applicability in practice. 

7. PC AND PFC MODELS WITH 
UNSTRUCTURED ERRORS 

Suppose that follows the general PC model 
(16) Xy = ti + rUy + aA^/^e, 

where the parameters and the error e have the same 
structure as in model (2) and the conditional covari- 
ance matrix Var(Xj/) = o"^ A > 0. Then we have the 
following. 

Proposition 6. Under model (16), the distri- 
bution 0/ y|X is the same as the distribution of 
y|r^A-^X for all values of X. 

We first consider implications of this proposition 
when A is known, and then turn to the case in which 
Var(Xy) is unknown. 



7.1 A known 

I 

The essential variance condition in the PC model 
(2) and the PFC model (5) is that Var(X2^) = A, 
where A is known but not necessarily the identity. 
According to Proposition 6 a sufficient reduction for 

(16) is r'^A'^X. Letting Zy = A-^/^X^, we have 

Zy = A~i/V + A~^^^Tuy + ae 
= /X* + r V* + ere, 

where the columns of F* are an orthonormal ba- 
sis for span(A~^/^r) and u* is the corresponding 

coordinate function. It follows that 5r = A^/^5r* 
and thus that the coordinates of the sufficient re- 
duction are in 

A^^cSr = A^i/^cSr*. In short, the 
required reductive subspace is estimated by 

the span of A~^/^ times the first d eigenvectors of 
A~^/^SA~"'^/^. An implication of this result is that 
principal components computed in the standardized 
Z-scale are appropriate reductions for both X and 
Z, because 

r^z = (r*^A-i/2)(Ai/2z) = (A-i/2r*)^x. 

Turning to the general PFC model 

(17) Xy = fi + Tp{y + aA^/^s, 

and following the discussion of model (16), r"^A~^X 
is again a sufficient reduction. The maximum like- 
lihood estimate of the reductive subspace is the span 

of A ' times the first d eigenvalues of 
A-i/2SfitA-V2. 

7.2 Var(Xy) unknown 

We now turn to the PFC model (17) with Var(Xy) 
unknown. For notational convenience, redefine A = 
Var(Xy), absorbing the scale parameter o"^ into the 
definition of A. Maximizing the log likelihood over 
P and fi, the resulting partially maximized log like- 
lihood 

(-n/2)log|D| 

(18) - (n/2)trace[D^i/2SD"^/2 

- PD-i/2GD"'/'£fitD-i/2] 

is a function of possible values D and G for A 
and r. For fixed D this function is maximized by 
choosing D^^/^Q i^g ^ basis for the span of the 
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first d eigenvectors of D ^/^^stD yielding an- 
other partially maximized log likelihood K(Y)), 

K(D)/n = (-l/2)log|D| 

-(l/2)|trace[D-i/2sD-V2] 

-^A.(D-V2Sfi,D-V2)| 

i=l ) 

= (-l/2)log|D| 

- (l/2)|trace[D-iErcs] 

+ A,(D-iEfit)|, 

i=d+l ) 

where Ai(A) indicates the ith eigenvalue of A, and 
the second equality was found by substituting S = 
Sfit + Sfcs- The first two terms alone, 

(-l/2)log|D| - (1/2) trace [D-iS,cs], 

are maximized by A = Xlres and this is a consistent 
estimator of A. The final term, 

-(1/2) Ai(D-iSfit), 

i=d+l 

reflects the fact that we may have r > d and thus 
that there is an error component in Sgt due to over- 
fitting. Since it is assumed that rank(5]fit) > d, this 
term will not be present if r = d, and use of A = Sres 
should be reasonable if r is not much larger than d. 
However, the final term may be important if r is 
substantially larger than d. 

Once A is determined, the estimate of the reduc- 
tive subspace is the span of A~^/^ times the first d 
eigenvectors of A~^/^Sfit A"'^/^. This is the same as 
the estimator in the case where A is known, except 
A is substituted for A. 

7.3 Prior Data Standardization for PC's 

Reduction by principal components is often based 
on the marginal correlation matrix of the predictors 
rather than on Y, (see, e.g., Jolliffe, 2002, page 169; 
L. Li and H. Li, 2004). In this section we provide a 
population-level discussion that may shed light on 
the appropriateness of this practice. 

The contours of the conditional covariance matrix 
are spherical in the first PC model (2), Var(Xy) = 
(T^/p, and the contours of Xl = a^Ip + rVar(i>'y)r"^ 



are elliptical. Generally, the method of maximum 
likelihood treats Var(Xy) as a reference point, with 
the impact of the response being embodied in the 
eigenvectors of 5] relative to Var(Xy). In the context 
of model (2), these eigenvectors are the same as the 
eigenvectors of 5]. When passing from Var(Xy) = 
fj^/p to S, the response distorts the conditional vari- 
ance by "pulling" its spherical contours parallel to 
the reductive subspace, which is then spanned by 
the first d principal components. The same ideas 
work for the first PFC model (5), except more in- 
formation is supplied to the mean function E(Xj^), 
with the consequence that the marginal covariance 
matrix of the predictors I] is replaced by the covari- 
ance matrix of the fitted values Xlgt. 

Likelihood estimation in all of the other normal 
models considered in this article can be interpreted 
similarly, but the calculations become more involved 
because the reference point Var(Xy) is no longer 
spherical. Consider first the general PC model (16) 
with A known. The reference point Var(Xjy) = o"^ A 
no longer has spherical contours, so the eigenstruc- 
ture of S is not sufficient to find the reductive sub- 
space. To find the eigenvectors of S relative to 
Var(X2^), first construct the standardized predictors 
Z = A^^/'^X. The reductive subspace in the Z-scale 
is then the span of the first d eigenvectors of Var(Z) = 
A~^/^SA~^/^, because the contours of Var(Zy) = 
cj^/p are spherical. The final step is to return to the 
X-scale by multiplying these eigenvectors by A^^^^. 
The general PFC model (17) follows the same pat- 
tern, as may be seen from the discussion at the end 
of Section 7.2. 

It may now be clear why the approach in this arti- 
cle provides little support for the common practice 
of standardizing the predictors so that the covari- 
ance matrix of the new predictors W = diag(S)^^/^X 
is a correlation matrix. The discussion of the PC 
model in Section 7.1 indicated that standardization 
should be based on A, not diag(5]), followed by back 
transformation to the original scale. Reduction using 
the eigenvectors of Var(W) requires that Var(Wy) 
be spherical, but this requirement will not gener- 
ally be met. As a consequence, the eigenvectors of 
Var(W) may have little useful relation to the reduc- 
tive subspace. 

As a simple example, consider the simulation mo- 
del (7), where the predictors are marginally and con- 
ditionally independent, with Var(Xy) = cr^/p. Con- 
sequently, Var(W) = Ip and the principal compo- 
nents of the standardized predictors do not convey 
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useful information. However, Var(Wj^) is not spher- 
ical, so the sufficient reduction should be computed 
from the eigenvectors of Var(W) = Ip relative to 
Var(Wj^). 

7.4 Connection with OLS for Y — X 

Suppose that we adopt model (17) with iy = {y — 
y) and A unknown, again absorbing into A. In 
this case, (i = r = l,/3isa scalar and T G M^. Letting 
C = Cov(X,y), 

^ CC^ 



which is of rank 1. Consequently, it follows from the 
previous section that 

_^ CC^ 

— ^rcs — S j-^ — 

and that 

The estimate is the span of Sres''^ times the 

eigenvector corresponding to the largest eigenvalue 

^—1/2^ ^—1/2 ^ 

of 5]rcs SfitSrcs • But Sfit has rank 1, and con- 
sequently the first (nonnormalized) eigenvector is 

^—1/2^ ^—1/2 

Sires C. Multiplying this by Sros , we have that 
cS^-ip = span(£~gC). Next, Sl^gC = fl~^C x con- 
stant, so the inverse method yields OLS under mo- 
del (17) with fy = {y — y): = span(S). The 

constant above is &i:'^C/{a^ - C'^'S^^C). This 
connection with OLS requires that d = r = 1 and 
that iy = (y — y). Estimators based on (17) and other 
values of d and r may thus be regarded as a subspace 
generalization of OLS. 

The relation = span(S:) that holds under 

model (17) allows us to reinterpret the results for 
OLS, PC and PFC in Figures 1 and 2 as a compari- 
son between three estimators based on inverse mod- 
els with different structures for E(Xj^) and Var(Xj^). 
The relative performance of the PC (2) and PFC 
(5) estimators in Figure 1 suggests that substan- 
tial gains are possible by modeling the inverse mean 
function. On the other hand, the relative perfor- 
mance of OLS (17) and the extended PFC estimator 
(13) in Figure 2 indicates that there are substantial 
costs associated with estimating A. These conclu- 
sions point to the extended PFC model (13) as a 
particularly useful target for dimension reduction in 
practice, since it requires a model for the inverse 
mean and avoids estimation of A when appropri- 
ate. 



7.5 Connection with SIR 

As in Section 6.3, we must use the slice basis func- 
tion (4) to relate the SIR estimator of the reduc- 
tive subspace to the estimator obtained from 
model (17) using A — ^res cLS the estimator of A. 
Letting ii denote an eigenvector of the normalized 
SIR matrix (cf. Section 6.3) 

5^-i/2|^ is a non-normalized eigenvector of S Sgt 
with the same eigenvalues as Sgir- The subspace 
spanned by the first d eigenvectors of S Sgt gives 
the SIR estimate of Similarly, the subspace 

spanned by the first d eigenvectors of 'Sj-^s ^fit is the 
estimate of the reductive subspace from model (17), 
still using A — Sj-es* These estimators are identi- 
cal provided ^rcs > 0, because then the eigenvectors 

of 5] Xlfit and I]j,gg5]fit are identical, with corre- 
sponding eigenvalues Aj and Aj/(1 — Aj) (see Ap- 
pendix A. 7). 

7.6 Simulations with Unstructured Errors 

To help fix ideas and provide results to direct fur- 
ther discussion, consider data simulated from the 
model 

(19) Xj, = Fy + A^/^e, 

where p = 10, y is a normal random variable with 
mean and standard deviation ay = 15, r = (1,..., 
1)^/^, and A was generated once as A = A A, 
where A is a p x p matrix of independent stan- 
dard normal random variables, yielding predictor 
variances of about 10 and correlations ranging be- 
tween 0.75 and —0.67. Four estimators of the suffi- 
cient reduction subspace were computed for 
each data set generated in this way: 

1. The OLS estimator. This is the same as the PFC 
estimator with fy = {y — y) (cf. Section 7.4). 

2. The PFC estimator with a third-degree polyno- 
mial in y for iy, designated PFC-Poly in later 
plots. 

3. The SIR estimator with eight slices. This is the 
same as the PFC estimator with the slicing op- 
tion for iy and A = Sres (cf. Section 7.5). 

4. The PFC estimator using the slicing construction 
for iy with eight slices and the true A, designated 
PFC- A in later plots (cf. Section 7.2). 
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(a) logg( Angle) versus n 

Fig. 3. Natural logarithm of the average angle versus (a) , 
model (19) with different covariance matrices A. 

Shown in Figure 3(a) are the natural logarithms of 
the average angles from 500 replications of each sam- 
pling configuration. The logarithms were necessary 
since the angles varied over several orders of mag- 
nitude. The OLS and PFC-Poly estimators were es- 
sentially indistinguishable and are represented by a 
single curve in Figure 3(a). These estimators seemed 
to do quite well, with the average angle varying be- 
tween 0.51 for n = 50 and 0.18 for n = 250. The 
performance of the PFC estimator using the true A 
was exceptional, the average angle varying between 
0.0034 and 0.0014. SIR performed the worst, its av- 
erage angle varying between 19.2 and 8 degrees. 

PFC did exceptionally well when using the true 
A, while sir's performance was considerably worse. 
The reason for this difference seems to be that SIR's 
slicing estimate of A is biased. In the context of 
this example, A = 5] — rr-^dy . But when using the 
slicing construction for fy , A = 'Si^^s is a consistent 
estimator of 

Asir = S-Var(E(X|yGFfc)) 

= s - rr^Var(E(y|y g Hk)), 

where indicates slice k, as defined in Section 4. 
The difference between these two population covari- 
ance matrices is 

A - Asir = rr^(4 - Var(E(y|y G Hk))) 
= rr^E(Var(y|y GFfc)), 

which is nonzero for continuous responses. One con- 
sequence of this bias is that SIR may not be able to 
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\mple size and (b) k for four estimators based on simulation 



find an exact or near exact fit. An exact fit occurs 
in the context of model (17) if A = 0, so T^X is 
a deterministic function of y. To illustrate this phe- 
nomenon we generated data from simulation model 
(19) with 

A = (c/io - rr^)4 

and c > 1. Here S = cay ho and an exact fit occurs 
if c = 1. Values of c < 1 are not allowed since then A 
will not be positive definite. With c = 1 + O.l/lO'^, 
Figure 3(b) shows the natural logarithm of the av- 
erage simulation angle versus k for the four estima- 
tors used in Figure 3(a) with n = 50. Clearly, SIR's 
response to increasing k is negligible. At k = 4 the 
average angles for SIR, OLS and PFC-A were about 
7.7, 0.085 and 0.0002, respectively. 

The general conclusions here are that (1) there can 
be a substantial cost associated with estimation of 
A (cf. Section 7.4) and (2) the slicing construction 
for fy may impose inherent limitations on the anal- 
ysis under model (17). The first conclusion does not 
occur for any of the other inverse models discussed 
in this article, because for them Sr is the reductive 
subspace, which does not require a direct estimate 
of A. 

8. DISCUSSION 

8.1 General Remarks 

Conditioning. Fisher believed that if a statistic is 
ancillary, then inferences should be made from the 
conditional distribution of the data given that statis- 
tic. As a consequence of this logic, many of us have 
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been taught and still practice what has become effec- 
tively a first principle of parametric regression: In- 
ference should be conditioned on the observed values 
of the predictors, even if Y and X are both random. 
It may be, however, that this principle has forced a 
myopic view of regression methodology. The inverse 
models studied in the previous sections describe the 
conditional distribution of X given Y and thereby 
make explicit use of the randomness in X. With 
these inverse models, we were able to achieve re- 
sults that are superior to those from standard meth- 
ods, and to those from recent dimension reduction 
methods. The success of these inverse models de- 
pends in part on imposing an appropriately restric- 
tive structure on the conditional variances Var(Xj^). 
They were presented, not as models for shotgun ap- 
plication, but as illustrations of potential, recalling 
Fisher's view that model specification is a matter for 
the practical statistician. In contrast, if we condition 
on the observed values of the predictors, they be- 
come known constants and the possibilities of infer- 
ring about their variance structure or utilizing prior 
knowledge are lost to us. 

Reductive subspaces. The reductive subspace pro- 
vides a connection between forward and inverse re- 
gressions. Starting with the PC model (2) and pass- 
ing through a series of extensions, the last two PFC 
models considered were the extended PFC model (13), 
Xj, = fJ, + r/3fy + ToflQEQ + Tfte, and the general 
PFC model (17), Xy = + Fpfy + A^/^^, in which 
A is unstructured and unknown. This final model 
connects inverse and forward regression methodol- 
ogy, since it is here that certain forward and in- 
verse estimates of are the same. While the 
error structure in model (13) is restrictive, it may 
be useful in some applications. Perhaps more im- 
portantly, we should be aware of the possibility to 
develop models "between" (13) and (17) that allow 
us to infer simultaneously about the reductive sub- 
space and about the relevant structure of Var(Xy). 

Simulation practices. Perhaps due in part to the 
conditioning tradition, it seems quite common to 
generate X as a N{0,I) random vector in simula- 
tion studies to compare regression methods. This 
practice may place notable limitations on the re- 
sults of the simulation, since it implies that there 
is no useful information in Var(X), as in the simu- 
lations with do = \/2 in Figure 2(c). This will give 
a clear edge to some forward methods. However, as 
demonstrated in this article, when (1^, X) has a joint 



distribution there may well be useful information in 
Var(X). 

Collinearity. The rationale for employing princi- 
pal components in regression has been rather un- 
even. Tied closely to the presence of collinearity, 
reduction by principal components has been seen 
as a way to compensate for variance inflation in 
the estimates of the regression coefficients. However, 
collinearity played no essential role in this article, 
suggesting that the utility of principal component 
reduction is broader than previously seen. 

n <p. Dimension reduction seems particularly im- 
portant in regressions where "n < p." Many avail- 
able methods encounter problems at an operational 
level because of the need to compute How- 
ever, with the exception of Section 7, the methods 
described in this article do not require the computa- 
tion of an inverse, and may therefore have value in 
regressions where n is not large relative to p. First 
simulation results sustain this conjecture, particu- 
larly when the methods are used in conjunction with 
predictor screening at the outset. As mentioned at 
the end of Section 2, past studies have based predic- 
tor screening on the univariate forward regressions 
of Y on Xj. However, the results here suggest that 
predictor screening be based on univariate inverse 
regressions of Xj on fy, j = 1, . . . ,p. 

8.2 Model-Based Sufficient Dimension 
Reduction 

At the outset of his book Statistical Methods for 
Research Workers, Fisher (1941, page 1) offered the 
following definition: 

Statistics may be regarded as (i) the study 
of populations, (ii) as the study of varia- 
tion, (iii) as the study of methods of the 
reduction of data. 

In this statement and in his commentary that fol- 
lows. Fisher seems to suggest that "reduction of 
data" may encompass more than just sufficient statis- 
tics; for instance, efficient statistics may be adequate 
for such purposes. For this reason I imagine that 
Fisher would not have objected to the notion of a 
sufficient reduction as defined in this article. 

The sufficient reductions described in the previ- 
ous sections are special cases of a general reductive 
paradigm that emerges from the following definition: 
A reduction i?:]RP — > M'', q<p, is sufficient if at 
least one of the following three statements holds: 
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Reductive forms. 

(i) x|(y,i?(x))~x|i?(x), 

(ii) y|x~y|i?(x), 

(iii) y_lLX|i?(X). 

Statement (i) corresponds to inverse regression and 
requires only the conditional distribution of X|y. 
For instance, if y is a categorical response indicat- 
ing one of two populations and X|y is normal with 
mean fiy and covariance matrix A, then Fisher's 
linear discriminant function is a sufficient reduc- 
tion (Rao, 1962). Statement (ii) corresponds to for- 
ward regression and requires only the conditional 
distribution of y|X, while statement (iii) requires 
the joint distribution of (y, X). The key point for 
present purposes is that these three forms are equiv- 
alent if y and X are both random. For example, we 
may determine a sufficient reduction from X|y (i) 
and then pass that reduction to the forward regres- 
sion (ii) or the joint distribution (iii) without spec- 
ifying the marginal distribution of Y or the condi- 
tional distribution of y|X. This is how all sufficient 
reductions were determined in the previous sections, 
the methods of derivation being essentially the same 
as the methods available for determining sufficient 
statistics. 

The connection with sufficient statistics goes fur- 
ther: If we set X equal to the data, X = D, and set 
y equal to the parameters, Y = 9, then statement 
(i) becomes D\{9,R) ~ D\R and we are led back to 
Fisher's concept of sufficiency (1). In this way, the 
notion of a sufficient reduction encompasses suffi- 
cient statistics as well. 

While the reductions discussed in the previous sec- 
tions are linear functions of X, sufficient reductions 
do not have to be linear. If X^ is normally dis- 
tributed with mean and Var(Xj;) = Ip + v'yTT'^ , 
i/y > -1, r G RP, then (r'^X)^ is a sufficient reduc- 
tion. 

8.3 Model-Free SufFicient Dimension Reduction 

In this article I have adopted a largely Fisherian 
perspective: (1) Find an adequate solution to the 
problem of specification for the conditional distri- 
bution of X|y, (2) use the inverse model for X|y 
to estimate a sufficient reduction -R(X), and then, 
as described in Section 8.2, (3) pass the estimated 
reduction to the forward regression. Lacking an in- 
verse model, these ideas are not directly applicable 
because then there is no probability structure with 
which to determine a sufficient reduction. However, 



progress is still possible by restricting the search for 
a sufficient reduction to a specific functional form. 
In view of their prevalence in the previous sections, 
linear reductions form a natural and potentially use- 
ful class. 

Consider then the reductive forms of Section 8.2 
with i?(X) = G^X, G G MP^^ k < p. One linear 
reduction always exists because statements (i)-(iii) 
are trivially true with G = Ip. If G"^X is a lin- 
ear reduction, then so is A^G-^X for any full rank 
^ g ^fcxfc^ suggesting again that interests center on 
the dimension reduction {DR) subspace Sg- If is 
a DR subspace and Sg ^ S, then S is also a DR sub- 
space. There may be infinitely many DR subspaces, 
and it therefore becomes necessary to consider the 
"smallest" subspace. 

There are at least two ways to define a small- 
est DR subspace. One way is to require a subspace 
^min = minfc Sg with the smallest dimension. How- 
ever, such subspaces are not necessarily unique and, 
even if they were, they do not impose sufficient struc- 
ture on the regression for progress in theory or un- 
complicated application in practice. Another way 
is to restrict attention to the class of regressions 
in which the intersection Sy\x = H^g of DR 
subspaces is itself a DR subspace. The central sub- 
space 5y|x (Cook, 1994, 1998) then becomes the 
parameter of interest. The reductive subspaces 5r 
and encountered in previous sections are in- 

stances of model-based DR subspaces. Since gener- 
ally there is no forward or inverse model to tie up 
loose ends like high-order conditional moments, it 
has proven quite hard to estimate the entire cen- 
tral subspace without some restrictions on the re- 
gression. Nevertheless, there are successful methods 
that can provide useful estimates of Sy\x under con- 
ditions that are weak relative to a parsimonious for- 
ward or inverse model. 

An introduction to model-free sufficient dimen- 
sion reduction via central subspaces is available from 
Cook (1998) and the references contained therein. 
See Cook and Ni (2005) for recent methodology and 
references to recent literature. 

APPENDIX 
A.l Propositions 1, 3 and 6 

We first demonstrate Proposition 6. Propositions 
1 and 3 will then follow as special cases. 

To demonstrate Proposition 6 we first show that 
the distribution of X|(r^A~^X, y = y) is the same 
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as the distribution of X|r"^A~^X for all y. Accord- 
ing to model (16), Xj^ is normally distributed with 
mean fiy = fj, + Tuy and constant variance. Thus 
X|(r^A~^X, y = y) is normally distributed with 
constant variance and mean 



E(x|r^A-ix,y = y) 



fx 



-y ' - A-ir(A) 



T 



X 



P 



T 



+ (-^p -^A~^r(A)^ 



where -PA~^r(A) operator that projects onto 

span(A~^r) in the A inner product. Prom this the 
last term in the second equation is 0. Since 
X|(r^A~^X, Y = y) is normally distributed with 
mean and variance that do not depend on y, it fol- 
lows that the distribution of X|(r^A~^X, Y = y) is 
the same as the distribution of X|r^A~^X for all y. 
Consequently, 1" _lLX|r"^A~"'^X, which implies that 
y|X and y|r"^A~^X have identical distributions. 

Proposition 1 follows immediately because A = 
Ip. Under Proposition 3, 

[Var(Xj,)]-i = ro^o 'r^ + rn-^r^. 

Consequently, [Var(Xy)]"ir = Tn~^ and Proposi- 
tion 3 follows. 

A. 2 Proposition 2 

We use a different approach to demonstrate this 
proposition. Suppose that under model (3) the joint 
density or mass function /(x|y) of X|(y = y) can be 
written as 

/(x|y) = /i(x)g(r^x,y), 

where /i is a function that does not depend on y 
and 5 is a function that depends on x only through 
P-^x. It would then follow from the usual factoriza- 
tion theorem for sufficiency that the distribution of 
X|(r'^X,y = y) is the same as the distribution of 
xjp^X for all y, and thus y_lLX|r^X. 

To demonstrate the required factorization, let Xj 
be the jth element of x and, using the conditional 
independence of the predictors, write 

p 

/(x|y) = n aj{'nyj)bj{xj)ew{xjriyj] 



: W aj{r}yj)bj{xj)Qyiv{xj{iij+-(]uy)} 



\{hj{xj)eyi^{xj^ij] 
Lj=l 

p 

exp{i/^r'^x}[|aj(r/yj; 

i=i 

/i(x)g(r^x,y). 



A. 3 Equation (6) 



Substituting /i = X and = (F'^F)^IF^XG into 
the log likelihood, we need to maximize 

Mppc = (-np/2)log(s2) 

- (l/2s2)^ ||Xy - X - PGX^F(F^F)-if^||' 



(-np/2)log(s2) 
- (l/2s^)i trace 



5](X,-Xf(X,-X) 



- trace PgX^F(F^F)-^ 



y 

— trace ^(Xj^ — X) 
- y 

X fJ(F^F)-iF'^XPG 
trace (F'^F)-^F^XPgX^ 
<F(F^F)-i^f,fJ 



But Eyfy(X^ - W = and Eyf/J = F^F. 

Thus 

Mppc = (-np/2)log(s2) 

- (l/2s2){ trace [X^X] - trace [Pg^^^^fX] 
- trace [X^PfXPg] 

-h trace [X^PfXPg]} 

= (-np/2)log(s2) 
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- (l/2s^){ trace [X' X] - tracefPc^^ Pf^]} A. 6 Proposition 5 

= (-np/2)log(s2) 

- (n/2s2){trace[S] - trace [Pc^fit]} • 
A. 4 Proposition 4 

Requiring the errors to be uncorrelated but not 
necessarily normal, it is known that S S, and 

E = Var(X) 

= E(Var(X|y)) + Var(E(X|y)) 

= Toftlrl + rn^T^ + r/3Var(fy)/3'^r'^. 

To find the limiting value of Xlgt = X^PpX/n, use 



Let C = /3 Var(fy)/3^ and H = G^TC^/^. Then it 
follows from Proposition 4 that 

-2Zppc(G) = log |G;^SGo| + log IG^SresGI 

= log |G^s,esGo + G^rcr^Gol 

+ log|G^SresG| 
= log|G^5]resGo| 

+ log|/rf + H^(G^SrcsGo)-^H| 

+ log|G^I],esG| 



model (13) to write 



and thus 

X = - X^) + F/3^r^ + Ro^or^ + TiftT^, 

PpX = Fp^T^ + PpRof^oro + PpR^^r^, 

where Rq is the n x p — d matrix with rows £q and 
R is the n X d matrix with rows e'^ . From this, 
X-^PpX/n contains nine terms. The first of these 
is r/3(F^F/n)r^/3^ ^r/3Var(fy)r^/3^. The re- 
maining terms involve products containing either or 
both of the factors F-^R/n and F-^Ro/n. Recalling 
that the rows (corresponding to samples) of R and 
Rq are independent, these factors both converge to 
and this property forces each of the eight remain- 
ing terms to converge to 0. For instance, the last 
quadratic term can be written m(R^PpR/n)nr^, 
and 

R^PpR/n = (R^F/n)(F^F/n)-(F^R/n), 

which converges to in probability by Slutsky's the- 
orem. 

Since S = Sgt + '^rcsi take the difference of the 
limiting values for S and Sgt to confirm the limiting 
value for Srcs- 

A. 5 Equation (15) 

Let O = (Oi, O2) be a partitioned p x p orthogo- 
nal matrix. Then 



lO^SOl 



= |ofsOi||o^i;o2 

- 01^X01(0^ SOi)-^Of SO2 

< |ofsOi||of'i;o2|. 



In addition. 



> log IG^ErcsGol + log IG^EresGI 

> log I Srcs I 

= \og\nl\\^i^\. 



logirj^^sFol+iogir^SrcsFi 
iog\nl\\ft'^\. 



Therefore -Lppc(G) > L{T) for all G. The minimiz- 
ing argument yields a unique subspace if Lppc(G) > 
Lppc(r) for all G such that G^G = Id, dim(cSG) = d 
and Scr / 5r • 

A. 7 Eigenvectors of Sl~^Sfit and S7esSfit 



^ (l-A)Sfit^ = ASres^ 

^ S-iSfit^ = (A/(l-A))l 

The conclusion follows because Srcs = S — Sgt > 
and A/(l — A) is a strictly monotonic function of A. 
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